Error correction translation using text corpora
نویسندگان
چکیده
In this paper, we propose an error correction method using text corpora. In this method, recognition errors are corrected using phonetically similar examples in the text corpora. The reliability of the correction hypotheses are judged according to their semantic consistency and their phonetic similarity to the original input. We previously proposed an error correction method that uses a treebank [1]. However, the previous method was not flexible in its use of examples, because structural mismatches occurred between the input and examples due to recognition errors. In our new proposal, examples are treated as morpheme sequences. This enables us to use examples partially when there are no useful full-sentence-examples. We built our proposed method into a speech translation system and compared the translation quality for simple translation and translation with error correction. The rate of acceptable translation increased about 10% with our proposed method compared to simple translation.
منابع مشابه
The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings
English as a Second Language (ESL) learners’ writings contain various grammatical errors. Previous research on automatic error correction for ESL learners’ grammatical errors deals with restricted types of learners’ errors. Some types of errors can be corrected by rules using heuristics, while others are difficult to correct without statistical models using native corpora and/or learner corpora...
متن کاملConstrained Grammatical Error Correction using Statistical Machine Translation
This paper describes our use of phrasebased statistical machine translation (PBSMT) for the automatic correction of errors in learner text in our submission to the CoNLL 2013 Shared Task on Grammatical Error Correction. Since the limited training data provided for the task was insufficient for training an effective SMT system, we also explored alternative ways of generating pairs of incorrect a...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملNAIST at 2013 CoNLL Grammatical Error Correction Shared Task
This paper describes the Nara Institute of Science and Technology (NAIST) error correction system in the CoNLL 2013 Shared Task. We constructed three systems: a system based on the Treelet Language Model for verb form and subjectverb agreement errors; a classifier trained on both learner and native corpora for noun number errors; a statistical machine translation (SMT)-based model for prepositi...
متن کامل